2022-10-03
Train a Machine learning Algorithm to predict Enzyme Comission Number (EC -number ) from Protein Sequence.
| Class | Reaction catalyzed | Typical reaction | Enzyme example |
|---|---|---|---|
| EC 1 Oxidoreductases | Oxidation/reduction reactions; transfer of H and O atoms or electrons from one substance to another | AH + B → A + BH (reduced) A + O → AO (oxidized) | Dehydrogenase, oxidase |
| EC 2 Transferases | Transfer of a functional groupf from one substance to another. The group may be methyl-, acyl-, amino- or phosphate group | AB + C → A + BC | Transaminase, kinase |
| EC 3 Hydrolases | Formation of two products from a substrate by hydrolysis | AB + H2O → AOH + BH | Lipase, amylase, peptidase, phosphatase |
| EC 4 Lyases | Non-hydrolytic addition or removal of groups from substrates. C-C, C-N, C-O or C-S bonds may be cleaved | RCOCOOH → RCOH + CO2 or [X-A+B-Y] → [A=B + X-Y] | Decarboxylase |
| EC 5 Isomerases | Intramolecule rearrangement, i.e. isomerization changes within a single molecule | ABC → BCA | Isomerase, mutase |
| EC 6 Ligases | Join together two molecules by synthesis of new C-O, C-S, C-N or C-C bonds with simultaneous breakdown of ATP | X + Y + ATP → XY + ADP + Pi | Synthetase |
| EC 7 Translocases | Catalyse the movement of ions or molecules across membranes or their separation within membranes | Transporter |
(from https://www.abbexa.com/enzyme-commission-number altered)
SAR -paradoxon: SAR: _Structure–Activity Relationship … refers to the fact that it is not the case that all similar molecules have similar activities.
Sequence > Structure > Function
Sequence > (ML) > Function
To work on the implementation of the algorithm we queried the Uniprot Database for reviewed human enzyme sequences with know EC numbers, using their REST-Api.
To create a larger training and validation dataset we used the training and validation data that had been used in the creation of ECPred a Java application for ECN prediction. The benefits in using this dataset was that it is already curated for machine learning, is very vast. The drawback was that the repository only contains UniProt accession numbers and not the sequences of proteins. For this reason we had to download the sequences and build the dataset ourselfs afterwards.
To keep the scope of the project reasonlable we focused only on the prediction of the main enzyme identifiier number EC{1,2,3,4,5,6}
# create directories
for(i in c("uniprot","expasy")){
dir.create(path = paste0("data/raw/",i),recursive = TRUE)
}
# Download fasta files using wget:
for( i in 1:7){
url<-paste0("https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=%28%28taxonomy_id%3A9606%29%20AND%20%28ec%3A",i,"%29%29%20AND%20%28reviewed%3Atrue%29")
dest<-paste0("data/raw/uniprot/EC_",i,"_HomoSapiens")
download.file(url,method = 'wget',destfile = dest)
}
list.files("data/raw/uniprot/")
# read fasta Files using seqinr- package:
EC_1<-seqinr::read.fasta("data/raw/uniprot/EC_1_HomoSapiens",
seqtype = "AA",
seqonly = FALSE)
EC_2<-seqinr::read.fasta("data/raw/uniprot/EC_1_HomoSapiens",
seqtype = "AA",
seqonly = TRUE)
# -> this gives us a list of S3 types "SeqFastaAA"
summary(EC_1)|>head(5)|>knitr::kable()
SeqFastaAA_list_to_matrix <- function(SeqFastaAA_list) {
statslist<-empty.dump()
## get AA-Composition list from fasta:
for(i in 1:length(SeqFastaAA_list)){
statslist[[i]]<-AAstat(SeqFastaAA_list[[i]],plot = FALSE)$Compo
}
## Set names:
names(statslist)<-names(SeqFastaAA_list)
## transform to matrix:
AA_matrix<-do.call(rbind,statslist)
## use three letter annotation:
colnames(AA_matrix)[-1]<-aaa(colnames(AA_matrix)[-1])
return(AA_matrix)
}
SeqFastaAA_list_to_matrix(EC_1)|>knitr::kable()
SeqFastaAA_list_to_matrix(EC_1)|>head(5)|>rowSums()|>knitr::kable()
EC_list<-empty.dump()
for(i in 1:7){
EC_list[[i]]<-seqinr::read.fasta(
paste0("data/raw/uniprot/EC_",i,"_HomoSapiens"),
seqtype = "AA",
seqonly = FALSE)
}
EC_matrix_list<-lapply(EC_list,SeqFastaAA_list_to_matrix)
# Add EC number:
names(EC_matrix_list)<-c(paste("EC",1:7))
## I call cbind.data.frame to keep the cells as int (cbind converts to character)
for(i in 1:7){
EC_matrix_list[[i]]<- cbind.data.frame( EC_matrix_list[[i]],"EC"=names(EC_matrix_list)[i])
}
# Bind rows to one big df:
comp_matrix<-do.call(rbind,EC_matrix_list)
saveRDS(comp_matrix,"data/comp_matrix_HSapiens.rds")
comp_matrix<-readRDS("data/comp_matrix_HSapiens.rds")
comp_matrix$EC<-as.factor(comp_matrix$EC)
head(comp_matrix)
## * Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met ## EC 1.sp|A0A087X1C5|CP2D7_HUMAN 0 42 9 23 28 31 36 16 16 15 66 11 ## EC 1.sp|A1KZ92|PXDNL_HUMAN 0 111 46 76 80 58 91 48 63 52 145 23 ## EC 1.sp|A2RUC4|TYW5_HUMAN 0 20 3 21 23 20 15 8 13 26 32 5 ## EC 1.sp|A5PLL7|PDES1_HUMAN 0 20 8 12 17 13 18 16 18 10 35 3 ## EC 1.sp|B2RXH2|KDM4E_HUMAN 0 33 12 17 31 20 45 20 18 20 41 15 ## EC 1.sp|C9JRZ8|AK1BF_HUMAN 0 21 3 19 27 21 14 10 19 31 27 7 ## Asn Pro Gln Arg Ser Thr Val Trp Tyr EC ## EC 1.sp|A0A087X1C5|CP2D7_HUMAN 11 40 23 41 25 25 44 7 6 EC 1 ## EC 1.sp|A1KZ92|PXDNL_HUMAN 62 96 79 103 95 92 87 21 35 EC 1 ## EC 1.sp|A2RUC4|TYW5_HUMAN 8 17 15 18 19 10 23 4 15 EC 1 ## EC 1.sp|A5PLL7|PDES1_HUMAN 7 14 9 14 9 15 13 13 6 EC 1 ## EC 1.sp|B2RXH2|KDM4E_HUMAN 18 32 30 30 35 35 24 10 20 EC 1 ## EC 1.sp|C9JRZ8|AK1BF_HUMAN 10 17 11 11 14 16 22 6 10 EC 1
### Plot number of enzymes per group comp_matrix %>% group_by(EC) %>% tally() %>% ggplot(aes(EC,n))+geom_col()
length of sequence per group
#### Plot length of sequence per group
AA_len<- cbind("EC"=comp_matrix$EC,"nAA"=rowSums(comp_matrix[1:21]))
boxplot(nAA ~ EC,data = AA_len)
## -> big outlier in EC 2
PCR - just for fun…
Key component in the assembly and functioning of vertebrate striated muscles. By providing connections at the level of individual microfilaments, it contributes to the fine balance of forces between the two halves of the sarcomere. The size and extensibility of the cross-links are the main determinants of sarcomere extensibility properties of muscle. In non-muscle cells, seems to play a role in chromosome condensation and chromosome segregation during mitosis. Might link the lamina network to chromatin or nuclear actin, or both during interphase.
comp_matrix %>% filter(row.names(comp_matrix) %in% c("EC 2.sp|Q8WZ42|TITIN_HUMAN"))
## * Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met ## EC 2.sp|Q8WZ42|TITIN_HUMAN 0 2084 513 1720 3193 908 2066 478 2062 2943 2117 398 ## Asn Pro Gln Arg Ser Thr Val Trp Tyr EC ## EC 2.sp|Q8WZ42|TITIN_HUMAN 1111 2517 942 1640 2463 2546 3184 466 999 EC 2
# just a giant enzyme....... -----
only Amino acid composition as features baseline
library(protr)
ids <- c("P00750", "P00751", "P00752")
prots <- getUniProt(ids)
prots <- removeGaps(prots)
prots <- subset(prots,unlist(lapply(prots, protcheck)))
prots# remove problematic sequences
prots <- lapply(prots,extractAAC ) #calcualte AAC
rm -r ECPred git clone --depth 1 --filter=blob:none --sparse https://github.com/cansyl/ECPred cd ECPred git sparse-checkout set ECPred\ Datasets
cd ./ECPred/ECPred\ Datasets #rm -rf !(EC_Main*) tree
## [01;34m.[0m ## ├── [01;34mEC_MainClass_NegativeTrain[0m ## │ ├── [00mHydrolases_Negative_Train.txt[0m ## │ ├── [00mIsomerases_Negative_Train.txt[0m ## │ ├── [00mLigases_Negative_Train.txt[0m ## │ ├── [00mLyases_Negative_Train.txt[0m ## │ ├── [00mOxidoreductases_Negative_Train.txt[0m ## │ └── [00mTransferases_Negative_Train.txt[0m ## ├── [01;34mEC_MainClass_NegativeValidation[0m ## │ ├── [00mHydrolases_Negative_Validation.txt[0m ## │ ├── [00mIsomerases_Negative_Validation.txt[0m ## │ ├── [00mLigases_Negative_Validation.txt[0m ## │ ├── [00mLyases_Negative_Validation.txt[0m ## │ ├── [00mOxidoreductases_Negative_Validation.txt[0m ## │ └── [00mTransferases_Negative_Validation.txt[0m ## ├── [01;34mEC_MainClass_PositiveTrain[0m ## │ ├── [00mHydrolases_Positive_Train.Txt[0m ## │ ├── [00mIsomerases_Positive_Train.txt[0m ## │ ├── [00mLigases_Positive_Train.txt[0m ## │ ├── [00mLyases_Positive_Train.txt[0m ## │ ├── [00mOxidoreductases_Positive_Train.txt[0m ## │ └── [00mTransferases_Positive_Train.txt[0m ## └── [01;34mEC_MainClass_PositiveValidation[0m ## ├── [00mHydrolases_Positive_Validation.txt[0m ## ├── [00mIsomerases_Positive_Validation.txt[0m ## ├── [00mLigases_Positive_Validation.txt[0m ## ├── [00mLyases_Positive_Validation.txt[0m ## ├── [00mOxidoreductases_Positive_Validation.txt[0m ## └── [00mTransferases_Positive_Validation.txt[0m ## ## 4 directories, 24 files